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Abstract — Relationships between entities in datasets are often 
of multiple nature, like geographical distance, social relationships, 
or common interests among people in a social network, for 
example. This information can naturally be modeled by a set 
of weighted and undirected graphs that form a global multi- 
layer graph, where the common vertex set represents the entities 
and the edges on different layers capture the similarities of the 
entities in term of the different modalities. In this paper, we 
address the problem of analyzing multi-layer graphs and propose 
methods for clustering the vertices by efficiently merging the 
information provided by the multiple modalities. To this end, we 
propose to combine the characteristics of individual graph layers 
using tools from subspace analysis on a Grassmann manifold. The 
resulting combination can then be viewed as a low dimensional 
representation of the original data which preserves the most im- 
portant information from diverse relationships between entities. 
We use this information in new clustering methods and test our 
algorithm on several synthetic and real world datasets where we 
demonstrate superior or competitive performances compared to 
baseline and state-of-the-art techniques. Our generic framework 
further extends to numerous analysis and learning problems that 
involve different types of information on graphs. 

Index Terms — Multi-layer graphs, subspace representation, 
Grassmann manifold, clustering. 

I. Introduction 

GRAPHS are powerful mathematical tools for modeling 
pairwise relationships among sets of entities; they can 
be used for various analysis tasks such as classification or 
clustering. Traditionally, a graph captures a single form of 
relationships between entities and data are analyzed in light 
of this one-layer graph. However, numerous emerging appli- 
cations rely on different forms of information to characterize 
relationships between entities. Diverse examples include hu- 
man interactions in a social network or similarities between 
images or videos in multimedia applications. The multimodal 
nature of the relationships can naturally be represented by a 
set of weighted and undirected graphs that share a common 
set of vertices but with different edge weights depending on 
the type of information in each graph. This can then be repre- 
sented by a multi-layer or multi-view graph which gathers all 
sources of information in a unique representation. Assuming 
that all the graph layers are informative, they are likely to 
provide complementary information and thus to offer richer 
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Fig. 1. (a) An illustration for a three-layer graph G, whose three layers 
{^*}f=i share the same set of vertices but with different edges, (b) A 
potential unified clustering {C^}^^-^ of the vertices based on the information 
provided by the three layers. 

information than any single layer taken in isolation. We thus 
expect that a proper combination of the information contained 
in the different layers leads to an improved understanding of 
the structure of the data and the relationships between entities 
in the dataset. 

In this paper, we consider a M-layer graph G with individ- 
ual graph layers Gi = {V, Ei^Ui}, i = 1, . . . , M, where V 
represents the common vertex set and Ei represents the edge 
set in the i-th individual graph Gi with associated edge weights 
uji. An example of a three-layer graph is shown in Fig. [T](a), 
where the three graph layers share the same set of 12 vertices 
but with different edges (we assume unit edge weights for 
the sake of simplicity). Clearly, different graph layers capture 
different types of relationships between the vertices, and our 
objective is to find a method that properly combines the 
information in these different layers. We first adopt a subspace 
representation for the information provided by the individual 
graph layers, which is inspired by the spectral clustering 
algorithms (11, ||2l, We then propose a novel method 
for combining the multiple subspace representations into one 
representative subspace. Specifically, we model each graph 
layer as a subspace on a Grassmann manifold. The problem 
of combining multiple graph layers is then transformed into 
the problem of efficiently merging different subspaces on a 
Grassmann manifold. To this end, we study the distances 
between the subspaces and develop a new framework to merge 
the subspaces where the overall distance between the repre- 
sentative subspace and the individual subspaces is minimized. 
We further show that our framework is well justified by results 
from statistical learning theory (H, Q. The proposed method 
is a dimensionality reduction algorithm for the original data; 
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it leads to a summarization of the information contained in the 
multiple graph layers, which reveals the intrinsic relationships 
between the vertices in the multi-layer graph. 

Various learning problems can then be solved using these 
relationships, such as classification or clustering. Specifically, 
we focus in this paper on the clustering problem: we want 
to find a unified clustering of the vertices (as illustrated in 
Fig. [T] (b)) by utilizing the representative subspace, such that 
it is better than clustering achieved on any of the graph 
layers Gi independently. To address this problem, we first 
apply our generic framework of subspace analysis on the 
Grassmann manifold to compute a meaningful summarization 
(as a representative subspace) of information contained in 
the individual graph layers. We then implement a spectral 
clustering algorithm based on the representative subspace. Ex- 
periments on synthetic and real world datasets demonstrate the 
advantages of our approach compared to baseline algorithms, 
like the summation of individual graphs | 6J, as well as state- 
of-the-art techniques, such as co-regularization ItI. Finally, we 
believe that our framework is beneficial not only to clustering, 
but also to many other data processing tasks based on multi- 
layer graphs or multi-view data in general. 

This paper is organized as follows. We first review the 
related work and summarize the contribution of the paper in 
Section [ll| In Section III we describe the subspace repre- 
sentation inspired by spectral clustering, which captures the 
characteristics of a single graph. In Section |IV| we review 
the main ingredients of Grassmann manifold theory, and 
propose a new framework for combining information from 
multiple graph layers. We then propose our novel algorithm for 
clustering on multi-layer graphs in Section |V| and compare its 
performance with other clustering methods on multiple graphs 
in Section |Vl| Finally, we conclude in Section |VII[ 

II. Related work 

In this section we review the related work in the literature. 
First, we describe briefly graph-based clustering algorithms, 
with a particular focus on the methods that have subspace 
interpretations. Second, we summarize the previous works 
built upon subspace analysis and the Grassmann manifold 
theory. Finally, we report the recent progresses in the field 
of analysis of multi-layer graphs or multi-view data. 

Clustering on graphs has been studied extensively due to its 
numerous applications in different domains. The works in O, 
191 have given comprehensive overviews of the advancements 
in this field over the last few decades. The algorithms that 
are based on spectral techniques on graphs are of particular 
interest, typical examples being spectral clustering iB, (21, O 
and modularity maximization via spectral method ifTOl . ifTTIl. 
Specifically, these approaches propose to embed the vertices of 
the original graph into a low dimensional space, usually called 
the spectral embedding, which consists of the top eigenvectors 
of a special matrix (graph Laplacian matrix for spectral cluster- 
ing and modularity matrix for modularity maximization). Due 
to the special properties of these matrices, clustering in such 
low dimensional spaces usually becomes trivial. Therefore, 
the corresponding clustering approaches can be interpreted 



as transforming the information on the original graph into 
a meaningful subspace representation. Another example is 
the Principal Component Analysis (PCA) interpretation on 
graphs described in lfT2l . which links the graph structure to 
a subspace spanned by the top eigenvectors of the graph 
Laplacian matrix. These works have inspired us to consider 



the subspace representation in Section III 



In the past few decades, subspace-based methods have 
been widely used in classification and clustering problems, 
most notably in image processing and computer vision. In 
|[T3l , |[T4L the authors have discovered that human faces can 
be characterized by low-dimensional subspaces. In |15|, the 
authors have proposed to use the so-called "eigenfaces" for 
recognition. Inspired by these works, researchers have been 
particularly interested in data where data points of the same 
pattern can be represented by a subspace. Due to the growing 
interests in this field, there is an increasingly large number 
of works that use tools from the Grassmann manifold theory, 
which provides a natural tool for subspace analysis. In |[T6l , 
the authors have given a detailed overview of the basics of the 
Grassmann manifold theory, and developed new optimization 
techniques on the Grassmann manifold. In ifTTl . the author has 
presented statistical analysis on the Grassmann manifold. Both 
works study the distances on the Grassmann manifold. In 1 13, 
|4|, the authors have proposed learning frameworks based on 
distance analysis and positive semidefinite kernels defined on 
the Grassmann manifold. Other recent representative works 
include the studies in |[T9l , 1201 where the authors have pro- 
posed to find optimal subspace representation via optimization 
on the Grassmann manifold, and the analysis in [21 1 where 
the authors have presented statistical methods on the Stiefel 
and Grassmann manifolds for applications in vision. Similarly, 
the work in |22| has proposed a novel discriminant analysis 
framework based on graph embedding for set matching, and 
the authors in f23l have presented a subspace indexing model 
on the Grassmann manifold for classification. However, none 
of the above works considers datasets represented by multi- 
layer graphs. 

At the same time, multi-view data have attracted a large 
amount of interest in the learning research communities. 
These data form multi-layer graph representations (or multi- 
view representations), which generally refer to data that can 
be analyzed from different viewpoints. In this setting, the 
key challenge is to combine efficiently the information from 
multiple graphs (or multiple views) for learning purposes. The 
existing techniques can be roughly grouped into the following 
categories. First, the most straightforward way is to form a 
convex combination of the information from the individual 
graphs. For example, in 1241 . the authors have developed a 
method to learn an optimal convex combination of Laplacian 
kernels from different graphs. In 1251 , the authors have pro- 
posed a Markov mixture model, which corresponds to a convex 
combination of the normalized adjacency matrices of the 
individual graphs, for supervised and unsupervised learning. In 
1 26 1, the authors have presented several averaging techniques 
for combining information from the individual graphs for 
clustering. Second, following the intuitive approaches in the 
first category, many existing works aim at finding a unified 
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representation of the multiple graphs (or multiple views), but 
using more sophisticated methods. For instances, the authors 
in in, Ea, 1281, |29|, |30l, (HI have developed several 
joint matrix factorization approaches to combine different 
views of data through a unified optimization framework, where 
the authors in |32| have proposed to find a unified spectral 
embedding of the original data by integrating information 
from different views. Similarly, clustering algorithms based 
on Canonical Correlation Analysis (CCA) first project the data 
from different views into a unified low dimensional subspace, 
and then apply simple algorithms like single linkage or k- 
means to achieve the final clustering [[33|, ||34|. Third, unlike 
the previous methods that try to find a unified representa- 
tion before applying learning techniques, another strategy in 
the literature is to integrate the information from individual 
graphs (views) directly into the optimization problems for 
the learning purposes. Examples include the co-EM clustering 
algorithm proposed in |35|, and the clustering approaches 
proposed in 1361 , Q based on the frameworks of co-training 
(371 and co-regularization |38|. Fourth, particularly in the 
analysis of multiple graphs, regularization frameworks on 
graphs have also been applied. In |39|, the authors have 
presented a regularization framework over edge weights of 
multiple graphs to compute an improved similarity graph 
of the vertices (entities). In 1291 , (401 , the authors have 
proposed graph regularization frameworks in both vertex and 
graph spectral domain to combine individual graph layers. 
Finally, other representative approaches include the works in 
(JTl, f39l where the authors have defined additional graph 
representations to incorporate information from the original 
individual graphs, and the works in ll42l . 1431 , 1441 , 1451 where 
the authors have proposed ensemble clustering approaches 
by integrating clustering results from individual views. From 
this perspective, the proposed approach belongs to the second 
category mentioned above, where we first find a representative 
subspace for the information provided by the multi-layer graph 
and then implement the clustering step, or other learning tasks. 
We believe that this type of approaches is intuitive and easily 
understandable, yet still flexible and generic enough to be 
applied to different types of data. 

To summarize, the main differences between the related 
work and the contributions proposed in this paper are the fol- 
lowing. First, the research work on Grassmann manifold theory 
has been mainly focused on subspace analysis. The subspace 
usually comes directly from the data but are not linked to 
graph-based learning problems. Our paper makes the explicit 
link between subspaces and graphs, and presents a fundamen- 
tal and intuitive way of approaching the learning problems 
on multi-layer graphs, with help of subspace analysis on the 
Grassmann manifold. Second, we show the link between the 
projection distance on the Grassmann manifold (T6l , (TSl and 
the empirical estimate of the Hilbert- Schmidt Independence 
Criterion (HSIC) |5|. Therefore, together with the results in 
(U, we are able to offer a unified view of concepts from three 
different perspectives, namely, the projection distance on the 
Grassmann manifold, the Kullback-Leibler (K-L) divergence 
(46l and the HSIC |5|. This helps to understand better the 
key concept of distance measure in subspace analysis. Finally, 



using our novel layer merging framework, we provide a simple 
yet competitive solution to the problem of clustering on multi- 
layer graphs. We also discuss the influence of the relationships 
between the individual graph layers on the performance of 
the proposed clustering algorithm. We believe that this is 
helpful towards the design of efficient and adaptive learning 
algorithms. 

III. Subspace representation for graphs 

In this section, we describe a subspace representation for 
the information provided by a single graph. The subspace 
representation is inspired by spectral clustering, which studies 
the spectral properties of the graph information for partitioning 
the vertex set of the graph into several distinct subsets. 

Let us consider an weighted and undirected graph G = 
{V^E^uY, where V = {vi}^^^ represents the vertex set and 
E represents the edge set with associated edge weights uj, 
respectively. Without loss of generality, we assume that the 
graph is connected. The adjacency matrix W of the graph 
is a symmetric matrix whose entry Wij represents the edge 
weight if there is an edge between vertex Vi and Vj, or 
otherwise. The degree of a vertex is defined as the sum of the 
weights of all the edges incident to it in the graph, and the 
degree matrix D is defined as the diagonal matrix containing 
the degrees of each vertex along its diagonal. The normalized 
graph Laplacian matrix L is then defined as: 

L = D-^D-W)D-i. (1) 

The graph Laplacian is of broad interests in the studies of 
spectral graph theory |47|. Among several variants, we use 
the normalized graph Laplacian defined in Eq. ([T]), since its 
spectrum (i.e., its eigenvalues) always lie between and 2, a 
property favorable in comparing different graph layers in the 
following sections. We consider now the problem of clustering 
the vertices V = {vi}f^i of G into k distinct subsets such 
that the vertices in the same subset are similar, i.e., they 
are connected by edges of large weights. This problem can 
be efficiently solved by the spectral clustering algorithms. 
Specifically, we focus on the algorithm proposed in O, which 
solves the following trace minimization problem: 

min trOJ'LU), s.t. U'U = I, (2) 

where n is the number of vertices in the graph, k is the 
target number of clusters, and {■)' denotes the matrix transpose 
operator. It can be shown by a version of the Rayleigh-Ritz 
theorem | 3 1 that the solution U to the problem of Eq. ([2]) 
contains the first k eigenvectors (which correspond to the 
k smallest eigenvalues) of L as columns. The clustering of 
the vertices in G is then achieved by applying the /c-means 
algorithm BSl to the normalized row vectors of the matrix 
U'^. As shown in fT], the behavior of spectral clustering 
can be explained theoretically with analogies to several well- 
known mathematical problems, such as the normalized graph- 
cut problem 1 1 1, the random walk process on graphs |49|, and 

^We use the notation G for a single graph exclusively in this section. 
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problems in perturbation theory ISOl , ISTIl . This algorithm is 
summarized in Algorithm [T] 

Algorithm 1 Normalized Spectral Clustering |2| 
1: Input: 

W: the n X n weighted adjacency matrix of graph G 

k\ target number of clusters 
2: Compute the degree matrix D and the normalized graph 

Laplacian matrix L = D~^{D — W)D~^ . 
3: Let U G M^^^ be the matrix containing the first k 

eigenvectors ui^ . . . ^Uk of L (solution of ([2])). Normalize 

each row of U to get t/norm- 
4: Let i/j G (j = 1, . . . , n) be the j-th row of Unorm- 
5: Cluster yj in into k clusters Ci, . . . , C/c using the k- 

means algorithm. 
6: Output: 

Ci, . . . , Cfe! the cluster assignment 



We provide an illustrative example of the spectral clustering 
algorithm. Consider a single graph in Fig. |2] (a) with ten 
vertices that belong to three distinct clusters (i.e., n=10 and 
/c=3). For the sake of simplicity, all the edge weights are set 
to 1. The low dimensional matrix U that solves the problem 
of Eq. (|2]), which contains k orthonormal eigenvectors of the 
graph Laplacian L as columns, is shown in Fig. [2] (b). The 
matrix U is usually called the spectral embedding of the 
vertices, as each row of U can be viewed as the set of coordi- 
nates of the corresponding vertex in the /c-dimensional space. 
More importantly, due to the properties of the graph Laplacian 
matrix, such an embedding preserves the connectivity of the 
vertices in the original graph. In other words, two vertices that 
are strongly connected in the graph are mapped to two vectors 
(i.e., rows of U) that are close too in the /c-dimensional space. 
As a result, a simple /c-means algorithm can be applied to the 
normalized row vectors of U to achieve the final clustering of 
the vertices. 

Inspired by the spectral clustering theory, one can define 
a meaningful subspace representation of the original vertices 
in a graph by its /c-dimensional spectral embedding, which is 
driven by the matrix U built on the first k eigenvectors of 
the graph Laplacian L. Each row being the coordinates of the 
corresponding vertex in the low dimensional subspace, this 
representation contains the information on the connectivity 
of the vertices in the original graph. Such information can 
be used for finding clusters of the vertices, as shown above, 
but it is also useful for other analysis tasks on graphs. By 
adopting this subspace representation that "summarizes" the 
graph information, multiple graph layers can naturally be 
represented by multiple such subspaces (whose geometrical 
relationships can be quite flexible). The task of multi-layer 
graph analysis can then be transformed into the problem of 
effective combination of the multiple subspaces. This is the 
focus of the next section. 
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Fig. 2. An illustration of spectral clustering, (a) A graph with three clusters 
(color-coded) of vertices; (b) Spectral embedding of the vertices computed 
from the graph Laplacian matrix. The vertices in the same cluster are mapped 
to coordinates that are close to each other in R^. 



IV. Merging subspaces via analysis on the 
Grassmann manifold 

We have described above the subspace representation for 
each graph layer in the multi-layer graph. We discuss now 
the problem of effectively combining multiple graph layers 
by merging multiple subspaces. The theory of Grassmann 
manifold provides a natural framework for such a problem. 
In this section, we first review the main ingredients of the 
Grassmann manifold theory, and then move onto our generic 
framework for merging subspaces. 

A. Ingredients of Grassmann manifold theory 

By definition, a Grassmann manifold Q{k^n) is the set of 
/c-dimensional linear subspaces in R^, where each unique 
subspace is mapped to a unique point on the manifold. As 
an example. Fig. [3] shows two 2-dimensional subspaces in 
being mapped to two points on ^(2,3). The advantage 
of using tools from Grassmann manifold theory is thus two- 
fold: (i) it provides a natural representation for our problem: 
the subspaces representing the individual graph layers can be 
considered as different points^ on the Grassmann manifold; 
(ii) the analysis on the Grassmann manifold permits to use 
efficient tools to study the distances between points on the 
manifold, namely, distances between different subspaces. Such 
distances play an important role in the problem of merging 
the information from multiple graph layers. In what follows, 
we focus on the definition of one particular distance measure 
between subspaces, which will be used in our framework later 
on. 

Mathematically speaking, each point on Q{k^n) can be 
represented by an orthonormal matrix Y G R^^^ whose 
columns span the corresponding /c-dimensional subspace in 
R"^; it is thus denoted as span{Y). For example, the two 
subspaces shown in Fig. [s] can be denoted as span{Yi) 
and span{Y2) for two orthonormal matrices Yi and Y2. The 
distance between two points on the manifold, or between two 
subspaces span{Yi) and span{Y2), is then defined based on a 
set of principal angles {Oi}^^^ between these subspaces |52|. 
These principal angles, which measure how the subspaces are 
geometrically close, are the fundamental measures used to 
define various distances on the Grassmann manifold, such as 



^The necessity for row normalization is discussed in fsl and we omit this 
discussion here. However, the normalization does not change the nature of 
spectral embedding, hence, it does not affect our derivation later. 



^We assume that the Laplacian matrices of any pair of the two layers in 
the multi-layer graph have different sets of top eigenvectors. In this case, 
subspace representations for all the layers will be different from each other. 
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Fig. 3. An example of two 2-dimensional subspaces span(Yi) and 
span(Y2) in R^, which are mapped to two points on the Grassmann manifold 
^^(2,3). 

the Riemannian (geodesic) distance or the projection distance 
imi, ifTSl . In this paper, we use the projection distance, which 
is defined as: 

k 

d^roi{YuY2) = {Y,sm^0,)K (3) 

where Yi and Y2 are the orthonormal matrices represent- 
ing the two subspaces under comparison"^. The reason for 
choosing the projection distance is two-fold: (i) the projection 
distance is defined as the £^-norm of the vector of sines of 
the principal angles. Since it uses all the principal angles, 
it is therefore an unbiased definition. This is favorable as 
we do not assume any prior knowledge on the distribution 
of the data, and all the principal angles are considered to 
carry meaningful information; (ii) the projection distance can 
be interpreted using a one-to-one mapping that preserves 
distinctness: span{Y) YY' G R^^^. Note that the squared 
projection distance can be rewritten as: 

k 

k 
i=l 

= k-tr{YiYi'Y2Y2') 

= ^[2k-2tr{YiYi'Y2Y2')] 

= ^MYi%) + tr{Y2'Y2) - 2tr{YiYi'Y2Y2')] 

= ^\\Y,Yi' -Y2Y2'\\l, (4) 

where the third equality comes from the definition of the 
principal angles and the fifth equality uses the fact that 
Yi and I2 are orthonormal matrices. It can be seen from 
Eq. ^ that the projection distance can be related to the 
Frobenius norm of the difference between the mappings of 
the two subspaces span{Yi) and span{Y2) in R^^^. Because 
the mapping preserves distinctness, it is natural to take the 
projection distance as a proper distance measure between 
subspaces. Moreover, the third equality of Eq. ^ provides an 
explicit way of computing the projection distance between two 
subspaces from their matrix representations Yi and Y2 . We are 

^In the special case where Yi and Y2 represent the same subspace, we have 

^iproj(n,l"2) = 0. 



going to use it in developing the generic merging framework 
in the following section. 

To summarize, the Grassmann manifold provides a natural 
and intuitive representation for subspace-based analysis (as 
shown in Fig. |3]). The associated tools, namely the principal 
angles, permit to define a meaningful distance measure that 
captures the geometric relationships between the subspaces. 
Originally defined as a distance measure between two sub- 
spaces, the projection distance can be naturally generalized to 
the analysis of multiple subspaces, as we show in the next 
section. 

B. Generic merging framework 

Equipped with the subspace representation for individual 
graphs and with a distance measure to compare different 
subspaces, we are now ready to present our generic framework 
for merging the information from multiple graph layers. Given 
a multi-layer graph G with M individual layers {G^}^^, 
we first compute the graph Laplacian matrix Li for each 
Gi and then represent each Gi by the spectral embedding 
matrix Ui G R^^^ from the first k eigenvectors of L^, where 
n is the number of vertices and k is the target number of 
clusters. Recall that each of the matrices {Ui}fLi defines 
a /c-dimensional subspace in R^, which can be denoted as 
span{Ui). The goal is to merge these multiple subspaces in 
a meaningful and efficient way. To this end, our philosophy 
is to find a representative subspace span{U) that is close to 
all the individual subspaces span{Ui), and at the same time 
the representation U preserves the vertex connectivity in each 
graph layer. For notational convenience, in the rest of the 
paper we simply refer to the representations U and Ui as the 
corresponding subspaces, unless indicated specifically. 

The squared projection distance between subspaces defined 
in Eq. ^ can be naturally generalized for analysis of mul- 
tiple subspaces. More specifically, we can define the squared 
projection distance between the target representative subspace 
U and the M individual subspaces {Ui}f^i as the sum of 
squared projection distances between U and each individual 
subspace given by Ui'. 

M 

di,„^{u,mfi,) = j2di{u,Ui) 

i=l 
M 

= Y,[k-tr{UU'U,U/)] 

M 

= kM-Y^ tr(UU'UiUi). (5) 

The minimization of the distance measure in Eq. ([5| enforces 
the representative subspace U to be close to all the individual 
subspaces {Ui}f^i in terms of the projection distance on 
the Grassmann manifold. At the same time, we want U to 
preserve the vertex connectivity in each graph layer. This 
can be achieved by minimizing the Laplacian quadratic form 
evaluated on the columns of U, as also indicated by the 
objective function in Eq. ^ for spectral clustering. Therefore, 
we finally propose to merge multiple subspaces by solving the 
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following optimization problem that integrates Eq. ^ and Eq. 

M M 

min V trOJ'LiU) ^ a[kM -Y^ trOJU'UiUi)] , 
^^^^xfc^ ^ (6) 

s.t. U'U = /, 

where and Ui are the graph Laplacian and the subspace 
representation for Gi, respectively. The regularization param- 
eter a balances the trade-off between the two terms in the 
objective function. 

The problem of Eq. ([6]) can be solved in a similar man- 
ner as Eq. (|2]). Specifically, by ignoring constant terms and 
rearranging the trace form in the second term of the objective 
function, Eq. ([6]) can be rewritten as 

M M 

min tr[U'(y Li-ay"UiUi)U], s.t. U'U = L 

i=l i=l 

(7) 

It is interesting to note that this is the same trace minimization 
problem as in Eq. ([2]), but with a "modified" Laplacian: 

M M 

L^o6 = Y.L,-a^U,U/. (8) 

i=l i=l 

Therefore, by the Rayleigh-Ritz theorem, the solution to the 
problem of Eq. ([7| is given by the first k eigenvectors of 
the modified Laplacian Lmod. which can be computed using 
efficient algorithms for eigenvalue problems 1531 , 1541 . 

In the problem of Eq. ([6]) we try to find a representative 
subspace U from the multiple subspaces {/7i}^^. Such a 
representation not only preserves the structural information 
contained in the individual graph layers, which is encouraged 
by the first term of the objective function in Eq. ([6]), but also 
keeps a minimum distance between itself and the multiple 
subspaces, which is enforced by the second term. Notice that 
the minimization of only the first term itself corresponds to 
simple averaging of the information from different graph lay- 
ers, which usually leads to suboptimal clustering performance 
as we shall see in the experimental section. Similarly, imposing 
only a small projection distance to the individual subspaces 
{Ui}^i does not necessarily guarantee that is a good 
solution for merging the subspaces. In fact, for a given k- 
dimensional subspace, there are infinitely many choices for the 
matrix representation, and not all of them are considered as 
meaningful summarizations of the information provided by the 
multiple graph layers. However, under the additional constraint 
of minimizing the trace of the quadratic term U'LiU over all 
the graphs (which is the first term of the objective function in 
Eq. ([6])), the vertex connectivity in the individual graphs tends 
to be preserved in U. In this case, the smaller the projection 
distance between U and the individual subspaces, the more 
representative it is for all graph layers. 

C. Discussion of the distance function 

Interestingly, the choice of projection distance as a similarity 
measure between subspaces in the optimization problem of 
Eq. ([6]) can be well justified from information-theoretic and 



statistical learning points of view. The first justification is 
from the work of Hamm et al. |4|, in which the authors have 
shown that the Kullback-Leibler (K-L) divergence |46|, which 
is a well-known similarity measure between two probability 
distributions in information theory, is closely related to the 
squared projection distance. More specifically, the work in 
|4| suggests that, under certain conditions, we can consider a 
linear subspace Ui as the "flattened" limit of a Factor Analyzer 
distribution pi l55l : 

Pi : M{u,, Gi), Gi = U,U,' + (j^Id, (9) 

where J\f stands for the normal distribution, Ui G is the 
mean, [/^ G M"^^^ is a full-rank matrix with n > k > (which 
represents the subspace), a is the ambient noise level, and In 
is the identity matrix of dimension n. For two subspaces Ui 
and Uj, the symmetrized K-L divergence between the two 
corresponding distributions Pi and pj can then be rewritten 
as: 

dKL{pi.P2) = 2a^{a^^i) ^^^ " 2tr{U,U/UjU/)), (10) 

which is of the same form as the squared projection distance 
when we ignore the constant factor (see Eq. (|4])). This shows 
that, if we take a probabilistic view of the subspace representa- 
tions {Ui}f^i, then the projection distance between subspaces 
can be considered consistent with the K-L divergence. 

The second justification is from the recently proposed 
Hilbert- Schmidt Independence Criterion (HSIC) |5|, which 
measures the statistical dependence between two random vari- 
ables. Given KxijKx2 G M^^"^ that are the centered Gram 
matrices of some kernel functions defined over two random 
variables and the empirical estimate of HSIC is given 
by 

dmic{^u^2) = tr{K;^,Kx,). (11) 

That is, the larger the <iHSic(^i, ^2), the stronger the statistical 
dependence between and A'2. In our case, using the 
idea of spectral embedding, we can consider the rows of 
the individual subspace representations Ui and Uj as two 
particular sets of sample points in R^, which are drawn from 
two probability distributions governed by the information on 
vertex connectivity in Gi and Gj, respectively. In other words, 
the sets of rows of Ui and Uj can be seen as realizations of 
two random variables Xi and Xj . Therefore, we can define the 
Gram matrices of linear kernels on A'^ and A'j as: 

Kx, = {Ui')\Ui') = UiUi', 

Kx, = {U/)'{U/) = U,U/. (12) 

By applying Eq. ( [TT] i, we can see that: 

dHsic{Xi,Xj)=tr{Kx,Kx,) 

= tr{UiUi'UjU/) 

= k-dl,^{Ui,Uj). (13) 

This shows that the projection distance between subspaces Ui 
and Uj can be interpreted as the negative dependence between 
Xi and Xj , which reflect the information provided by the two 
individual graph layers Gi and Gj. 
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Therefore, from both information-theoretic and statistical 
learning points of view, the smaller the projection distance 
between two subspace representations Ui and Uj, the more 
similar the information in the respective graphs that they 
represent. As a result, the representative subspace (the solution 
U to the problem of Eq. ([6])) can be considered as a subspace 
representation that "summarizes" the information from the 
individual graph layers, and at the same time captures the 
intrinsic relationships between the vertices in the graph. As 
one can imagine, such relationships are of crucial importance 
in our multi-layer graph analysis. 

In summary, the concept of treating individual graphs as 
subspaces, or points on the Grassmann manifold, permits 
to study the desired merging framework in a unique and 
principled way. We are able to find a representative subspace 
for the multi-layer graph of interest, which can be viewed as 
a dimensionality reduction approach for the original data. We 
finally remark that the proposed merging framework can be 
easily extended to take into account the relative importance 
of each individual graph layer with respect to the specific 
learning purpose. For instance, when prior knowledge about 
the importance of the information in the individual graphs 
is available, we can adapt the value of the regularization 
parameter a in Eq. ([6]) to the different layers such that 
the representative subspace is closer to the most informative 
subspace representations. 



one individual graph, our merging framework provides a 
representative subspace that contains the information from the 
multiple graph layers. Using this representation, we can then 
follow the same steps of spectral clustering to achieve the 
final clustering of the vertices with a /c-means algorithm. The 
proposed clustering algorithm is summarized in Algorithm [2] 

Algorithm 2 Spectral Clustering on Multi-Layer graphs (SG- 
ML) 
1: Input: 

{Wi}f£i'. nxn weighted adjacency matrices of individual 

graph layers {Gi}fi^ 

k\ target number of clusters 

a: regularization parameter 
2: Compute the normalized Laplacian matrix Li and the 

subspace representation Ui for each Gi. 
3: Compute the modified Laplacian matrix Lmod = 

4: Compute U eW^^^ that is the matrix containing the first 
k eigenvectors i^i, . . . , life of Lmod- Normalize each row 

of U to get [/norm. 

5: Let i/j G (j = 1, . . . , n) be the j-th row of [/norm- 
6: Cluster i/j in into Ci , . . . , C/c using the /c-means 

algorithm. 
7: Output: 

Ci, . . . , Cfei The cluster assignment 



V. Clustering on multi-layer graphs 



In Section IV we introduced a novel framework for merging 
subspace representations from the individual layers of a multi- 
layer graph, which leads to a representative subspace that 
captures the intrinsic relationships between the vertices of the 
graph. This representative subspace provides a low dimen- 
sional form that can be used in several applications involving 
multi-layer graph analysis. In particular, we study now one 
such application, namely the problem of clustering vertices in 
a multi-layer graph. We further analyze the behavior of the 
proposed clustering algorithm with respect to the properties 
of the individual graph layers (subspaces). 

A. Clustering algorithm 



As we have already seen in Section |IIl| the success of the 
spectral clustering algorithm relies on the transformation of 
the information contained in the graph structure into a spectral 
embedding computed from the graph Laplacian matrix, where 
each row of the embedding matrix (after normalization) is 
treated as the coordinates of the corresponding vertex in a 
low dimensional subspace. In our problem of clustering on 
a multi-layer graph, the setting is slightly different, since we 
aim at finding a unified clustering of the vertices that takes 
into account information contained in all the individual layers 
of the multi-layer graph. However, the merging framework 
proposed in the previous section can naturally be applied 
in this context. In fact, it leads to a natural solution to the 
clustering problem on multi-layer graphs. In more details, 
similarly to the spectral embedding matrix in the spectral 
cluttering algorithm, which is a subspace representation for 



It is clear that Algorithm [2] is a direct generalization of 
Algorithm [T] in the case of multi-layer graphs. The main in- 
gredient of our clustering algorithm is the merging framework 



proposed in Section IV in which information from individual 
graph layers is summarized, prior to the actual clustering 
process (i.e., the /c-means step) is implemented. This provides 
an example that illustrates how our generic merging framework 
can be applied to specific learning tasks on multi-layer graphs. 

B. Analysis of the proposed algorithm 

We now analyze the behavior of the proposed clustering 
algorithm under different conditions. Specifically, we first 
outline the link between subspace distance and clustering 
quality, and then compare the clustering performances in 
two scenarios where the relationships between the individual 
subspaces {Ui}f^i are different. 



As we have seen in Section IV the rows of the subspace 



representations {Ui}f^i can be viewed as realizations of 
random variables {^i}fti governed by the graph information. 
At the same time, spectral clustering directly utilizes Ui for the 
purpose of clustering. Therefore, {^i}fti can be considered 
as random variables that control the cluster assignment of the 
vertices. In fact, it has been shown in | 3 1 that the matrix Ui is 
closely related to the matrix that contains the cluster indicator 
vectors as columns. Since the projection distance can be 
understood as the negative statistical dependence between such 
random variables, the minimization of the projection distance 
in Eq. ^ is equivalent to the maximization of the dependence 
between the random variable from the representative subspace 
U and the ones from the individual subspaces {Ui}f^i. The 
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Fig. 4. A 3-layer graph with unit edge weights for toy example 1. The colors Fig. 5. A 3-layer graph with unit edge weights for toy example 2. The colors 
indicate the groundtruth clusters. indicate the groundtruth clusters. 



TABLE I 
Analysis of toy example 1. 





layer Gi 


layer G2 


layer O3 


SC-ML 


NMI 


0.6279 


0.6181 


0.2673 


1.0000 


(a) clustering performances for toy example 1 




layer Gi 


layer G2 


layer G3 


subspace computed 
by SC-ML 


layer 6i 





1.1100 


1.3670 


0.9456 


layer G2 


LllOO 





1.3354 


1.0452 


layer 6'^ 


L3670 


1.3354 





1.0788 



TABLE II 
Analysis of toy example 2. 





layer Gi 


layer G2 


layer O3 


SC-ML 


NMI 


0.7934 


0.2673 


0.4728 


0.5300 


(a) clustering performances for toy example 2 




layer G'l 


layer G2 


layer G3 


subspace computed 
by SC-ML 


layer Gi 





1.3098 


1.2296 


1.0311 


layer G2 


1.3098 





0.9343 


0.8828 


layer Gt, 


1.2296 


0.9343 





0.5058 



(b) subspace distances for toy example 1 



(b) subspace distances for toy example 2 



optimization in Eq. ([6]) can then be seen as a solution that tends 
to produce a clustering with the representative subspace that is 
consistent with those computed from the individual subspace 
representations. 

We now discuss how the relationships between the individ- 
ual subspaces possibly affect the performance of our clustering 
algorithm SC-ML. Intuitively, since the second term of the 
objective function in Eq. ([6]) represents the distance between 
the representative subspace U and all the individual subspaces 
{/7i}^i, it tends to drive the solution towards those subspaces 
that themselves are close to each other on the Grassmann 
manifold. To show it more clearly, let us consider two toy 
examples. The first example is illustrated in Fig. [4j where we 
have a 3-layer graph with the individual layers Gi, G2 and G3 
sharing the same set of vertices. For the sake of simplicity, all 
the edge weights are set to one. In addition, three groundtruth 
clusters are indicated by the colors of the vertices. Table|T|(a) 
shows the performances of Algorithm [T] with individual layers 
as well as Algorithm [2^ for the multi-layer graph, in terms 
of Normalized Mutual Information (NMI) 1561 with respect 
to the groundtruth clusters. Table |I| (b) shows the projection 
distances between various pairs of subspaces. It is clear that 
the layers Gi and G2 produce better clustering quality, and that 
the distance between the corresponding subspaces is smaller. 
However, the vertex connectivity in layer G3 is not very 
consistent with the groundtruth clusters and the corresponding 
subspace is further away from the ones from Gi and G2. In 
this case, the solution found by SC-ML is enforced to be 
close to the consistent subspaces from Gi and G2, hence 
provides satisfactory clustering results {NMI = 1 represents 
perfect recovery of groundtruth clusters). Let us now consider 
a second toy example, as illustrated in Fig. |5] In this example 
we have two layers G2 and G3 with relatively low quality 
information with respect to the groundtruth clustering of 
the vertices. As we see in Table |ll| (b), their corresponding 



subspaces are close to each other on the Grassmann manifold. 
The most informative layer Gi, however, represents a subspace 
that is quite far away from the ones from G2 and G3. At the 
same time, we see in Table|Il|(a) that the clustering results are 
better for the first layer than for the other two less informative 
layers. If the quality of the information in the different layers 
is not considered in computing the representative subspace, 
SC-ML enforces the solution to be closer to two layers 
of relatively lower quality, which results in unsatisfactory 
clustering performance in this case. 

The analysis above implies that the proposed clustering 
algorithm works well under the following assumptions: (i) the 
majority of the individual subspaces are relatively informa- 
tive, namely, they are helpful for recovering the groundtruth 
clustering, and (ii) they are reasonably close to each other on 
the Grassmann manifold, namely, they provide complementary 
but not contradictory information. These are the assumptions 
made in the present work. As we shall see in the next section, 
these assumptions seem to be appropriate and realistic in real 
world datasets. If it is not the case, one may assume that 
a preprocessing step cleans the datasets, or at least provides 
information about the reliability of the information in the 
different graph layers. 

VI. Experimental results 

In this section, we evaluate the performance of the SC- 
ML algorithm presented in Section |V] on several synthetic 
and real world datasets. We first describe the datasets that we 
use for the evaluation, and then explain the various clustering 
algorithms that we adopt in the performance comparisons. We 
finally present the results in terms of three evaluation criteria 
as well as some discussions. 

^We choose the value of the regularization parameter that leads to the best 
possible clustering performance. More discussions about the choices of this 
parameter are presented in Section I VII 
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A. Datasets 

We adopt one synthetic and two real world datasets with 
multi-layer graph representation for the evaluation of the 
clustering algorithms. We give a brief overview of the datasets 
as follows. 

The first dataset that we use is a synthetic dataset, where 
we have three point clouds in forming the English letters 
"N", "R" and "C" (shown in Fig. [6]). Each point cloud is 
generated from a five-component Gaussian mixture model with 
different values for the mean and variance of the Gaussian 
distributions, where each component represents a class of 500 
points with specific color. A 5 -nearest neighbor graph is then 
constructed for each point cloud by assigning the weight of 
the edges connecting two vertices (points) as the reciprocal of 
the Euclidean distance between them. This gives us a 3 -layer 
graph of 2500 vertices, where each graph layer is from a point 
cloud forming a particular letter. The goal with this dataset is 
to recover the five clusters (indicated by five colors) of the 
2500 vertices using the three graph layers constructed from 
the three point clouds. 

The second dataset contains data collected during the Lau- 
sanne Data Collection Campaign [ 57 1 by the Nokia Research 
Center (NRC) in Lausanne. This dataset contains the mobile 
phone data of 136 users living and working in the Lake 
Leman region in Switzerland, recorded over a one-year period. 
Considering the users as vertices in the graph, we construct 
three graphs by measuring the proximities between these users 
in terms of GPS locations, Bluetooth scanning activities and 
phone communication. More specifically, for GPS locations 
and bluetooth scans, we measure how many times two users 
are sufficiently close geographically (within a distance of 
roughly 1 km), and how many times two users' devices have 
detected the same bluetooth devices, respectively, within 30- 
minute time windows. Aggregating these results for a one- 
year period leads to two weighted adjacency matrices that 
represent the physical proximities of the users measured with 
different modalities. In addition, an adjacency matrix for 
phone communication is generated by assigning edge weights 
depending on the number of calls between any pair of two 
users. These three adjacency matrices form a 3 -layer graph of 
136 vertices, where the goal is to recover the eight groundtruth 
clusters that have been constructed from the users' email 
affiliations. 

The third dataset is a subset of the Cora bibliographic 
dataset^. This dataset contains 292 research papers from 
three different fields, namely, natural language processing, 
data mining and robotics. Considering papers as vertices in 
the graph, we construct the first two graphs by measuring 
the similarities among the title and the abstract of these 
papers. More clearly, for both title and abstract, we represent 
each paper by a vector of non-trivial words using the Term 
Frequency -Inverse Document Frequency (TF-IDF) weighting 
scheme, and compute the cosine similarities between every 
pair of vectors as the edge weights in the graphs. Moreover, 
we add a third graph which reflects the citation relationships 
among the papers, namely, we assign an edge with unit weight 
between papers A and B if A has cited or been cited by B. 






Fig. 6. Three five-class point clouds in forming English letters "N", "R" 
and "C". 



This results in a 3-layer graph of 292 vertices, and the goal 
in this dataset is to recover the three clusters corresponding to 
the different fields the papers belong to. 

To visualize the graphs in the three datasets, the spy plot 
of the adjacency matrices of the graphs are shown in Fig. [7] 
(a), (b) and (c) for the synthetic, NRC and Cora dataset, 
respectively, where the orderings of the vertices are made 
consistent with the groundtruth clusters^. A spy plot is a global 
view of a matrix where every non-zero entry in the matrix 
is represented by a blue dot (without taking into account 
the value of the entry). As shown in these figures, we see 
clearly the clusters in the synthetic and Cora datasets, while 
the clusters in the NRC dataset are not very clear. The reason 
for this is that, in the NRC dataset, the email affiliations used 
to create the groundtruth clusters only provides approximative 
information. 



B. Clustering algorithms 

We now explain briefly the clustering algorithms in our 
comparative performance analysis along with some imple- 
mentation details. We adopt three baseline algorithms as well 
as a state-of-the-art technique, namely the co-regularization 
approach introduced in |7|. As we shall see, there is an inter- 
esting connection between this approach and the proposed al- 
gorithm. First of all, we describe some implementation details 
of the proposed SC-ML algorithm and the co-regularization 
approach in | 7 |: 

• SC-ML: Spectral Clustering on Multi-Layer graphs, as 
presented in Section [V] The implementation of SC- 
ML is pretty straightforward, and the only parameter 
to choose is the regularization parameter a in Eq. (|6]). 
In our experiments, we choose the value of a through 
multiple empirical trials and report the best clustering 
performance. Specifically, we choose a to be 0.64 for 
the synthetic dataset and 0.44 for both real world datasets. 
We will discuss the choice of this parameter later in this 
section. 

• SC-CoR: Spectral Clustering with Co-Regularization 
proposed in Q. We follow the same practice as in |7| 
to choose the most informative graph layer to initialize 



^Available online at ''http://people.cs.umass.edu/~mccallum/data.htmr 



der category "Cora Research Paper Classification". 

^The adjacency matrix for GPS proximity in the NRC dataset is thresholded 
for better illustration. 
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the alternating optimization scheme in SC-CoR. The 
stopping criteria for the optimization process is chosen 
such that the optimization stops when changes in the 
objective function are smaller than 10~^. Similarly, we 
choose the value of the regularization parameter a in SC- 
CoR through multiple empirical trials and report the best 
clustering performance. As in |7 |, the parameter a is fixed 
in the optimization steps for all graph layers. 

Next, we introduce three baseline comparative algorithms 
that work as follows: 

• SC-Single: Spectral Clustering (Algorithm [T]) applied on 
a single graph layer, where the graph is chosen to be the 
one that leads to the best clustering results. 

• SC-Sum: Spectral clustering applied on a global matrix 
W that is the summation of the normalized adjacency 
matrices of the individual layers: 

M 

w = Y,d;'^w^d;~\ (14) 

2=1 

• SC-KSum: Spectral clustering applied on the summation 



K of the spectral kernels 1^1 of the adjacency matrices: 

M d 

K = ^Ki with Ki= ^ UimUim', (15) 

i—1 m=l 

where n is the number of vertices, d <C n is the number of 
eigenvectors used in the definition of the spectral kernels 
Ki, and represents the m-th eigenvector of the 
Laplacian Li for graph G^. To make it more comparable 
with spectral clustering, we choose d to be the target 
number of clusters in our experiments. 

C. Results and discussions 

We evaluate the performance of the different clustering 
algorithms with three different criteria, namely Purity, Nor- 
malized Mutual Information (NMI) and Rand Index (RI) (561. 
The results are summarized in Table |III| (a), (b) and (c) for 
the synthetic, NRC and Cora dataset, respectively. For each 
scenario, the best two results are highlighted in bold fonts. 
First, as expected, we see that the clustering performances 
for the synthetic and Cora datasets are higher than that 
for the NRC dataset, which indicates that the latter one is 
indeed more challenging due to the approximative groundtruth 
information. Second, it is clear that SC-ML and SC-CoR 
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TABLE III 

Performance comparison of different clustering algorithms on (a) the synthetic dateset, (b) the NRC dataset, and (c) the Cora 

DATASET. 





SC-Single 


SC-Sum 


SC-KSum 


SC-CoR 


SC-ML 


Purity 


0,8580 


0,9752 


0,9768 


0.97S4 


0.9828 


NMI 


0,7266 


0,9224 


0,9262 


0.927S 


0.9407 


Rf 


0,9018 


0,9806 


0,9818 


0.9S30 


0.9864 



(a) 





SC- Single 


SC-Sum 


SC-KSum 


SC-CoR 


SC-ML 


Purity 


0,5147 


0.5956 


0,5294 


0,5809 


0.6103 


NMI 


0,3133 


0,3988 


0,3440 


0.4056 


0.4156 


Rf 


0,7326 


0,7852 


0,7667 


0.7878 


0.7929 



(b) 





SC-Single 


SC-Sum 


SC-KSum 


SC-CoR 


SC-ML 


Purity 


0,9555 


0,9795 


0,9726 


0.9829 


0.9829 


NMI 


0,8314 


0,9062 


0,8863 


0.9175 


0.9175 


R! 


0,9426 


0,9731 


0,9645 


0.9775 


0.9775 



(c) 



generally outperform the baseline approaches for the three 
datasets. More specifically, although both SC-Sum and SC- 
KSum indeed improve the clustering quality compared to 
clustering with individual graph layers, they only provide 
limited improvement, and the potential drawback for both of 
the summation methods is that they can be considered as 
similar to building a simple average graph for representing 
the different layers of information. Therefore, depending on 
data characteristics in specific datasets, this might smooth out 
the particular information provided by individual layers, and 
thus penalize the clustering performance. In comparison, SC- 
ML and SC-CoR always achieve significant improvements in 
the clustering quality compared to clustering using individual 
graph layers. 

We now take a closer look at the comparisons between 
SC-ML and SC-CoR. Although the latter is not developed 
from the viewpoint of subspace analysis on the Grassmann 
manifold, it can actually be interpreted as a process in which 
individual subspace representations are updated based on the 
same distance analysis as in our framework. In this sense, 
SC-CoR uses the same distance as ours to measure sim- 
ilarities between subspaces. The merging solution however 
leads to a different optimization problem than that of Eq. 
([6]), which is based on a slightly different merging philos- 
ophy. Specifically, it enforces the information contained in 
the individual subspace representations to be consistent with 
each other. An alternating optimization scheme optimizes, 
at each step, one subspace representation, while fixing the 
others. This can be interpreted as a process in which one 
subspace at each step becomes closer to other subspaces in 
term of the projection distance on the Grassmann manifold. 
Upon convergence, all initial subspaces are "brought" closer 



to each other and the final subspace representation from the 
most informative graph layer is considered as the one that 
combines information from all the graph layers efficiently. Two 
illustrations of SC-CoR and SC-ML are shown in Fig. [8] (a) 
and (b), respectively. Therefore, on the one hand, results for 
both approaches demonstrate the benefit of using our distance 
analysis on the Grassmann manifold for merging information 
in multi-layer graphs. Indeed, for both approaches, since the 
distances between the solutions and the individual subspaces 
are minimized without sacrificing too much of the information 
from individual graph layers, the resulting combinations can 
be considered as good summarizations of the multiple graph 
layers. On the other hand, however, SC-ML differs from SC- 
CoR mainly in the following aspects. First, the alternating 
optimization scheme in SC-CoR focuses only on optimizing 
one subspace representation at each step, and it requires a 
sensible initialization to guarantee that the algorithm ends up 
at a good local minimum for the optimization problem; it 
also does not guarantee that all the subspace representations 
converge to one point on the Grassmann manifold (it uses 
the final update of the most informative layer for clustering)^ . 
In contrast, SC-ML directly finds a single representation 
through a unique optimization of the representative subspace 
with respect to all graph layers jointly, which does not need 
alternating optimization steps and careful initializations. These 
are the possible reasons that explain why SC-ML performs 
better than SC-CoR in our experiments, as we can see in 



Table |lll| Second, it is worth noting that, from a computational 
point of view, the optimization process involved in SC-ML is 
much simpler than that in SC-CoR. Specifically, the iterative 
nature of SC-CoR requires solving an eigenvalue problem for 
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(a) (b) 

Fig. 8. Illustrations of graph layer merging, (a) Co-regularization |7|: iterative update of the individual subspace representations. The upper index [A^] 
represents the number of iterative steps on each individual subspace representation. The final update of the subspace representation for the most informative 
graph shown as a star) is considered as a good combination; (b) Proposed merging framework: the representative subspace (U, shown as a star) is 

found in one step. 




Fig. 9. Performances of SC-ML and SC-CoR under different values of parameter a in the corresponding implementations. 



MN times, where M and N are the number of individual 
graphs and the number of iterations needed for the algorithm 
to converge, respectively. In contrast, since SC-ML aims at 
finding a globally representative subspace without modifying 
the individual ones, it needs to solve an eigenvalue problem 
only once. 

Finally, we discuss the influence of the choice of the 
regularization parameter a on the performance of SC-ML. 
In Fig. [9j we compare the performances of SC-ML and SC- 
CoR in terms of NMI under different values of parameter a 
in the corresponding implementations. As we can see, in our 
experiments, SC-ML achieves the best performances when a 
is chosen between 0.4 and 0.6, and it outperforms SC-CoR for 
a large range of a for the synthetic and NRC datasets. For the 
Cora dataset, the two algorithms achieve the same performance 
at different values of a, but SC-ML permits a larger range of 
parameter selection. Furthermore, it is worth noting that the 
optimal values for a in SC-ML lie in similar ranges across 
different datasets, thanks to the adoption of the normalized 
graph Laplacian matrix whose spectral norm is upper bounded 
by 2. In summary, this shows that the performance of SC-ML 
is reasonably stable with respect to the parameter selection. 

VII. Conclusions 

In this paper, we provide a framework for analyzing in- 
formation provided by multi-layer graphs and for clustering 
vertices of graphs in rich datasets. Our generic approach 
is based on the transformation of information contained in 
the individual graph layers into subspaces on the Grassmann 
manifold. The estimation of a representative subspace can then 
be essentially considered as the problem of finding a good 

^In 1 7 1, the authors have also proposed a "centroid-based co-regularization 
approach" that introduces a consensus representation. However, such a rep- 
resentation is still computed via an alternating optimization scheme, which 
needs a sensible initialization and keeps the same iterative nature. 



summarization of multiple subspaces using distance analysis 
on the Grassmann manifold. The proposed approach can be 
applied to various learning tasks where multiple subspace 
representations are involved. Under appropriate and realistic 
assumptions, we show that our framework can be applied 
to the clustering problem on multi-layer graphs and that it 
provides an efficient solution that is competitive to the state-of- 
the-art techniques. Finally, we mention the following research 
directions as interesting and open problems. First, the subspace 
representation inspired by spectral clustering is not the only 
valid representation for the graph information. As suggested 
by the works in |10|, |11|, the eigenvectors of the modularity 
matrix of the graph can also be used as low dimensional 
subspace representation for the information contained in the 
graph. Therefore, an interesting problem is to find the most 
appropriate subspace representation for the data available, 
either they are graphs or of some more general forms. Second, 
we believe that better clustering performance can be achieved 
if prior information on the data is available, in particular about 
the consistency of the information in the different graph layers. 
These problems are however left for future studies. 
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